sheet music
Towards an AI Musician: Synthesizing Sheet Music Problems for Musical Reasoning
Wang, Zhilin, Yang, Zhe, Luo, Yun, Li, Yafu, Qu, Xiaoye, Qiao, Ziqian, Zhang, Haoran, Zhan, Runzhe, Wong, Derek F., Zhou, Jizhe, Cheng, Yu
Enhancing the ability of Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) to interpret sheet music is a crucial step toward building AI musicians. However, current research lacks both evaluation benchmarks and training data for sheet music reasoning. Inspired by mathematics, where simple operations yield infinite verifiable problems, we introduce a novel approach that treats core music theory rules, such as those governing beats and intervals, as programmatic functions to systematically synthesize a vast and diverse corpus of sheet music reasoning problems. This approach allows us to introduce a data synthesis framework that generates verifiable sheet music questions in both textual and visual modalities, leading to the Synthetic Sheet Music Reasoning Benchmark (SSMR-Bench) and a complementary training set. Evaluation results on SSMR-Bench highlight the key role reasoning plays in interpreting sheet music, while also pointing out the ongoing challenges in understanding sheet music in a visual format. By leveraging synthetic data for RL VR, all models show significant improvements on the SSMR-Bench. Additionally, they also demonstrate considerable advancements on previously established human-crafted benchmarks, such as MusicTheoryBench and the music subset of MMMU. Finally, our results show that the enhanced reasoning ability can also facilitate music composition. "The sheet music is the language of musicians." Recent advancements in Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) have inspired researchers to explore the potential of developing AI musicians (Qu et al., 2025; Bradshaw & Colton, 2025; Wang et al., 2024). Given that sheet music is the universal language of musicians, the ability to read and interpret it is an essential step for AI musicians (Y uan et al., 2024; Wang et al., 2025). As illustrated in Figure 1, sheet music reasoning differs fundamentally from Music Knowledge QA (Li et al., 2024), which evaluates memorized knowledge, and from sheet music recognition (Chen et al., 2025a), which focuses on identifying notation from images.
A document is worth a structured record: Principled inductive bias design for document recognition
Meyer, Benjamin, Tuggener, Lukas, Hänzi, Sascha, Schmid, Daniel, Ayfer, Erdal, Grewe, Benjamin F., Abdulkadir, Ahmed, Stadelmann, Thilo
Many document types use intrinsic, convention-driven structures that serve to encode precise and structured information, such as the conventions governing engineering drawings. However, state-of-the-art approaches treat document recognition as a mere computer vision problem, neglecting these underlying document-type-specific structural properties, making them dependent on sub-optimal heuristic post-processing and rendering many less frequent or more complicated document types inaccessible to modern document recognition. We suggest a novel perspective that frames document recognition as a transcription task from a document to a record. This implies a natural grouping of documents based on the intrinsic structure inherent in their transcription, where related document types can be treated (and learned) similarly. We propose a method to design structure-specific inductive biases for the underlying machine-learned end-to-end document recognition systems, and a respective base transformer architecture that we successfully adapt to different structures. We demonstrate the effectiveness of the so-found inductive biases in extensive experiments with progressively complex record structures from monophonic sheet music, shape drawings, and simplified engineering drawings. By integrating an inductive bias for unrestricted graph structures, we train the first-ever successful end-to-end model to transcribe engineering drawings to their inherently interlinked information. Our approach is relevant to inform the design of document recognition systems for document types that are less well understood than standard OCR, OMR, etc., and serves as a guide to unify the design of future document foundation models.
Unified Cross-modal Translation of Score Images, Symbolic Music, and Performance Audio
Jung, Jongmin, Kim, Dongmin, Lee, Sihun, Cho, Seola, Soh, Hyungjoon, Bukey, Irmak, Donahue, Chris, Jeong, Dasaem
Music exists in various modalities, such as score images, symbolic scores, MIDI, and audio. Translations between each modality are established as core tasks of music information retrieval, such as automatic music transcription (audio-to-MIDI) and optical music recognition (score image to symbolic score). However, most past work on multimodal translation trains specialized models on individual translation tasks. In this paper, we propose a unified approach, where we train a general-purpose model on many translation tasks simultaneously. Two key factors make this unified approach viable: a new large-scale dataset and the tokenization of each modality. Firstly, we propose a new dataset that consists of more than 1,300 hours of paired audio-score image data collected from YouTube videos, which is an order of magnitude larger than any existing music modal translation datasets. Secondly, our unified tokenization framework discretizes score images, audio, MIDI, and MusicXML into a sequence of tokens, enabling a single encoder-decoder Transformer to tackle multiple cross-modal translation as one coherent sequence-to-sequence task. Experimental results confirm that our unified multitask model improves upon single-task baselines in several key areas, notably reducing the symbol error rate for optical music recognition from 24.58% to a state-of-the-art 13.67%, while similarly substantial improvements are observed across the other translation tasks. Notably, our approach achieves the first successful score-image-conditioned audio generation, marking a significant breakthrough in cross-modal music generation.
Source Separation & Automatic Transcription for Music
Derby, Bradford, Dunker, Lucas, Galchar, Samarth, Jarmale, Shashank, Setti, Akash
Source separation is the process of isolating individual sounds in an auditory mixture of multiple sounds [1], and has a variety of applications ranging from speech enhancement and lyric transcription [2] to digital audio production for music. Furthermore, Automatic Music Transcription (AMT) is the process of converting raw music audio into sheet music that musicians can read [3]. Historically, these tasks have faced challenges such as significant audio noise, long training times, and lack of free-use data due to copyright restrictions. However, recent developments in deep learning have brought new promising approaches to building low-distortion stems and generating sheet music from audio signals [4]. Using spectrogram masking, deep neural networks, and the MuseScore API, we attempt to create an end-to-end pipeline that allows for an initial music audio mixture (e.g...wav file) to be separated into instrument stems, converted into MIDI files, and transcribed into sheet music for each component instrument.
Just Label the Repeats for In-The-Wild Audio-to-Score Alignment
Bukey, Irmak, Feffer, Michael, Donahue, Chris
We propose an efficient workflow for high-quality offline alignment of in-the-wild performance audio and corresponding sheet music scans (images). Recent work on audio-to-score alignment extends dynamic time warping (DTW) to be theoretically able to handle jumps in sheet music induced by repeat signs-this method requires no human annotations, but we show that it often yields low-quality alignments. As an alternative, we propose a workflow and interface that allows users to quickly annotate jumps (by clicking on repeat signs), requiring a small amount of human supervision but yielding much higher quality alignments on average. Additionally, we refine audio and score feature representations to improve alignment quality by: (1) integrating measure detection into the score feature representation, and (2) using raw onset prediction probabilities from a music transcription model instead of piano roll. We propose an evaluation protocol for audio-to-score alignment that computes the distance between the estimated and ground truth alignment in units of measures. Under this evaluation, we find that our proposed jump annotation workflow and improved feature representations together improve alignment accuracy by 150% relative to prior work (33% to 82%).
Image2Struct: Benchmarking Structure Extraction for Vision-Language Models
Roberts, Josselin Somerville, Lee, Tony, Wong, Chi Heem, Yasunaga, Michihiro, Mai, Yifan, Liang, Percy
We introduce Image2Struct, a benchmark to evaluate vision-language models (VLMs) on extracting structure from images. Our benchmark 1) captures real-world use cases, 2) is fully automatic and does not require human judgment, and 3) is based on a renewable stream of fresh data. In Image2Struct, VLMs are prompted to generate the underlying structure (e.g., LaTeX code or HTML) from an input image (e.g., webpage screenshot). The structure is then rendered to produce an output image (e.g., rendered webpage), which is compared against the input image to produce a similarity score. This round-trip evaluation allows us to quantitatively evaluate VLMs on tasks with multiple valid structures. We create a pipeline that downloads fresh data from active online communities upon execution and evaluates the VLMs without human intervention. We introduce three domains (Webpages, LaTeX, and Musical Scores) and use five image metrics (pixel similarity, cosine similarity between the Inception vectors, learned perceptual image patch similarity, structural similarity index measure, and earth mover similarity) that allow efficient and automatic comparison between pairs of images. We evaluate Image2Struct on 14 prominent VLMs and find that scores vary widely, indicating that Image2Struct can differentiate between the performances of different VLMs. Additionally, the best score varies considerably across domains (e.g., 0.402 on sheet music vs. 0.830 on LaTeX equations), indicating that Image2Struct contains tasks of varying difficulty. For transparency, we release the full results at https://crfm.stanford.edu/helm/image2struct/v1.0.1/.
Decades' Worth of Musical History Is About to Disappear. You've Probably Heard Nothing About It.
Last month, nothing short of an earthquake-level upheaval struck the professional music industry. On Aug. 26, the president of the Colorado-based tech company MakeMusic announced that the firm would be making "no further updates" to Finale, the pioneering and popular music-notation app that the firm had been selling and updating for 35 years. "Technology stacks change, Mac and Windows operating systems evolve, and Finale's millions of lines of code add up," MakeMusic's Greg Dell'Era wrote in his first (and likely last) contribution to the company's Finale-centric blog. "Instead of releasing new versions of Finale that would offer only marginal value to our users, we've made the decision to end its development." In other words: A key computer program for digitizing and expediting the arduous process of writing and formatting the types of sheet music used by musicians and ensembles everywhere--orchestras, schoolkids, the theater world, session instrumentalists, pop producers--would be phased out by the following year, with no hopes for revival.
Towards Robust and Truly Large-Scale Audio-Sheet Music Retrieval
Carvalho, Luis, Widmer, Gerhard
A range of applications of multi-modal music information retrieval is centred around the problem of connecting large collections of sheet music (images) to corresponding audio recordings, that is, identifying pairs of audio and score excerpts that refer to the same musical content. One of the typical and most recent approaches to this task employs cross-modal deep learning architectures to learn joint embedding spaces that link the two distinct modalities - audio and sheet music images. While there has been steady improvement on this front over the past years, a number of open problems still prevent large-scale employment of this methodology. In this article we attempt to provide an insightful examination of the current developments on audio-sheet music retrieval via deep learning methods. We first identify a set of main challenges on the road towards robust and large-scale cross-modal music retrieval in real scenarios. We then highlight the steps we have taken so far to address some of these challenges, documenting step-by-step improvement along several dimensions. We conclude by analysing the remaining challenges and present ideas for solving these, in order to pave the way to a unified and robust methodology for cross-modal music retrieval.